Method and equipment for determining common subsequence of text strings

ABSTRACT

A method for determining a longest common subsequence in a plurality of text strings. The method comprises: separately converting a plurality of text strings into word sequences (S100); classifying the word sequences (S400); and performing longest common subsequence computation on every class (S500). The time needed by LCS computation can be saved by classifying text strings.

CROSS REFERENCE TO RELATED APPLICATION(S)

This application is the U.S. national stage application under 35 U.S.C §371 of International Patent Application No. PCT/CN2016/099631, filed onSep. 21, 2016, claims the benefit of priority under 35 U.S.C § 119 ofChina Patent Application No. 201510685864.6, filed on Oct. 21, 2015, thecontents of each which are hereby incorporated by reference in theirentireties.

FIELD OF THE INVENTION

The present invention relates to the field of computer applications, andin particular to a method and equipment for determining a commonsubsequence of text strings.

BACKGROUND OF THE INVENTION

Nowadays, people pay more and more attention to network security, andvarious security devices including firewalls are widely used. However,only the deployment of security devices is not enough to protect thenetwork security, relevant personnel also need to continuously monitorand analyse the logs generated by the security devices because the logscontain very valuable information. For example, they can use the logs todetect security threats such as network intrusions, virus attacks,abnormal behaviours, and abnormal traffic, so as to selectivelyconfigure and adjust the overall network security strategy.

One way to analyse logs is to classify log events into severalcategories such as “information”, “error”, and “warning”. This method ofanalysis has limitations. Due to the large number and complexity oflogs, important event information is likely to be submerged in the“warning” category and not processed in a timely manner. Therefore, inorder to facilitate statistics, detect problems in a timely manner andavoid submerging small events of one type in other events of the sametype, the logs need to be subdivided so that the type of event can bedetermined from the logs and processed accordingly.

The logs have a feature that they are in different formats based ondifferences in text and source. For example, there are differencesbetween the formats of logs from firewalls and web servers. In addition,the logs can still be subdivided according to their meanings, even ifthe sources are the same.

The conventional method of subdividing the logs is to calculate thelongest common subsequence (LCS), that is, to merge two log textstogether and extract the common sequence part, so as to determinewhether the two can be classified into one category. However, thisconventional method only supports two texts. In the case of a pluralityof log texts, any two of the texts need to be calculated, resulting in avery large amount of computation.

SUMMARY OF THE INVENTION

According to one aspect of the present invention, a method fordetermining the longest common subsequence among a plurality of textstrings is provided, which comprises: converting a plurality of the textstrings into word sequences respectively; converting the word sequencesinto corresponding word sets respectively; calculating the minimum hashvalue for each word set; classifying the word sequences according to theminimum hash value; and performing the longest common subsequenceoperation in each category.

In here and in the following content, the term “word sequence” refers toa sequence of words; correspondingly, the term “word set” refers to aset of words. Namely, the constituent elements of the sequence and theset are all words. The difference between the two is that the elementsin the sequence can be repeated and must have an order, but in the setthe order of the elements is not considered and the elements are notrepeated.

According to another aspect of the present invention, an equipment fordetermining the longest common subsequence among a plurality of textstrings is provided, which comprises: a first conversion device forconverting a plurality of the text strings into word sequencesrespectively; a second conversion device for converting the wordsequences into corresponding word sets respectively; a first operationdevice for calculating a minimum hash value for each word set; and aclassification device for classifying the word sequences into categoriesaccording to the minimum hash value; and a second operation device forperforming the longest common subsequence operation in each category.

The embodiments of the present invention may include one or more of thefollowing features.

Two word sequences with a minimum hash distance less than a firstthreshold are classified into the same category.

The longest common subsequence operation includes: selecting a wordsequence in the category as the first word sequence, and respectivelycalculating the longest common subsequences of the first word sequencewith other word sequences in the category until the length of thelongest common subsequence obtained is greater than a second threshold.

The longest common subsequence operation includes: deleting the firstword sequence from the category if all the lengths of the longest commonsubsequences obtained are not greater than the second threshold, andcontinuing the longest common subsequence operation.

The longest common subsequence having a length greater than the secondthreshold is determined as a text string template.

The text string template is used to calculate the longest commonsequence in turn with other word sequences in the category. During thecalculation, the longest common subsequence having a length greater thanthe second threshold is determined as a new text string template and thecalculation is continued.

The final text string template is output, and the word sequence in thecategory which can match the final text string template is deleted.

The longest common subsequence operation is continued until the categoryis empty.

Some embodiments of the present invention may have one or more of thefollowing benefits: compared with conventional LCS algorithms, multipletexts are supported, and a minimum hash algorithm is used to quicklydetermine whether the differences between the texts are too large,thereby effectively saving the time required for the LCS operation.

Other aspects, features, and advantages of the present invention will befurther clarified in the detailed description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be further described below with reference tothe accompanying drawings.

FIG. 1 is a flowchart of a method for determining the longest commonsubsequence in a plurality of text strings according to the presentinvention;

FIG. 2 and FIG. 3 are flowcharts of performing the longest commonsubsequence operation and determining a text string template accordingto an embodiment; and

FIG. 4 is a block diagram of an equipment for determining the longestcommon subsequence among a plurality of text strings according to thepresent invention.

DETAILED DESCRIPTION OF THE INVENTION

Referring to FIG. 1, step S100, the text strings are converted intocorresponding word sequences respectively. The following furtherdescribes step S100 by way of example using text strings A and B. Assumethat the text string A is: “the quick brown fox jumps over the lazydog”; the text string B is: “the lazy brown dog jumps over the quickfox”.

The text string A undergoes word segmentation to obtain a word sequenceA: {the, quick, brown, fox, jumps, over, the, lazy, dog}. The textstring B undergoes word segmentation to obtain a word sequence B: {the,lazy, brown, dog, jumps, over, the, quick, fox}.

In addition to the text strings based on the Latin alphabet in the aboveexample, the object for word segmentation may also include Chinese textstrings, and the schemes for supporting Chinese word segmentationinclude, for example, CRF, MMSEG, and the like. A simple method of wordsegmentation is to look up a Chinese word database. For example, using athesaurus comprising “

” (Chinese word for “China”), “

” (Chinese word for “People”) and “

” (Chinese word for “Republic”), a Chinese-English mixed phrase “how totranslate

” can be subdivided into six words including ‘how’, ‘to’, ‘translate’, “

”, “

” and “

”.

The segmentation affects the elemental composition and the length of theword sequence. The longer the length of the word sequence is, the longerthe time required for the subsequent execution of the LCS algorithm is.However, it should be pointed out that apart from the speed of LCSoperation, the word segmentation basically has no effect on the resultsof the entire algorithm.

According to step S200, the word sequences are converted intocorresponding word sets respectively. Again, the word sequences A and Bare used as examples, and in the conversion process, only one of therecurring words is remained, such as “the”. After conversion, word set Ais [the, quick, brown, fox, jumps, over, lazy, dog]; word set B is [the,lazy, brown, dog, jumps, over, quick, fox].

In the case where there are multiple text strings, all the text stringscan be converted into corresponding word sets respectively according tothe above steps.

According to step S300, a minimum hash (MinHash) value of each word setis calculated, and the minimum hash value is used to determine thesimilarity of the two sets. A variety of methods for calculating MinHashare known. Shown below is one of the pseudo-codes implemented based onPython.

#!/usr/bin/env python # -*- coding: utf-8 -*- def minhash(data,hashfuncs): # DEBUG = True DEBUG = False rows, cols, sigrows =len(data), len(data[0]), len(hashfuncs) # sigmatrix = [[1000000] *cols] * sigrows sigmatrix = [ ] for i in range(sigrows):sigmatrix.append([10000000] * cols) for r in range(rows): hashvalue =map(lambda x: x(r), hashfuncs) if DEBUG: print hashvalue for c inrange(cols): if DEBUG: print ‘-’ * 2, r, c if data[r][c] == 0: continuefor i in range(sigrows): if DEBUG: print ‘-’ * 4, i, sigmatrix[i][c],hashvalue[i] if sigmatrix[i][c] > hashvalue[i]: sigmatrix[i][c] =hashvalue[i] if DEBUG: print ‘-’ * 4, sigmatrix if DEBUG: for xxxxxxx insigmatrix: print xxxxxxx print ‘=’ * 30 return sigmatrix if _name_ ==‘_main_’: def hash1(x): return (x + 1) % 5 def hash2(x): return (3 *x + 1) % 5 data = [[1, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 1], [1, 0, 1,1], [0, 0, 1, 0]] print minhash(data, [hash1, hash2])

In step S400, the MinHash distance of any two word sets is calculated.The MinHash is a fixed-length value. Assuming the length is 64 bits, thenumber of bits with the same position but different values among the 64bits is the MinHash distance of two MinHash values. Here, the MinHashdistance of the two sets being short is a necessary but non-sufficientcondition for the similarity of the two sets. This is because MinHashitself is a probabilistic method with false positives. In addition, theMinHash considers only the element set and ignores the order in whichthe elements appear. Taking the text strings A and B as examples,although the two text strings are different, the corresponding word setshave the same MinHash value.

The MinHash distance is compared with a first threshold, and the wordsequences corresponding to the two word sets whose MinHash distances areless than the first threshold are classified into the same category. Thefirst threshold is adjustable, and its default value can be set as 80%of the number of the bits of MinHash. Through step S400, the wordsequences corresponding to all text strings are classified into one ormore categories.

According to step 500, the longest common subsequence operation isperformed in each category. A variety of methods of LCS operations areknown. Shown below is one of the pseudo-codes implemented based on Java.

publicclassLCSProblem { publicstaticvoidmain(String[ ]args) { String[]x={“”,“A”,“B”,“C”,“B”,“D”,“A”,“B”}; String[]y={“”,“B”,“D”,“C”,“A”,“B”,“A”}; int[ ][ ]b=getLength(x,y);Display(b,x,x.length-1,y.length-1); } publicstaticint[ ][]getLength(String[ ]x,String[ ]y) { int[ ][]b=newint[x.length][y.length]; int[ ][ ]c=newint[x.length][y.length];for(inti=1;i<x.length;i++) { for(intj=1;j<y.length;j++) { if(x[i]==y[j]){ c[i][j]=c[i−1][j−1]+1; b[i][j]=1; } elseif(c[i−1][j]>=c[i][j−1]) {c[i][j]=c[i−1][j]; b[i][j]=0; } else { c[i][j]=c[i][j−1]; b[i][j]=−1; }} } returnb; } publicstaticvoidDisplay(int[ ][ ]b,String[ ]x,inti,intj){ if(i==0||j==0) return; if(b[i][j]==1) { Display(b,x,i−1,j−1);System.out.print(x[i]+“”); } elseif(b[i][j]==0) { Display (b,x,i−1,j); }elseif(b[i][j]==−1) { Display (b,x,i,j−1); } } }

Step S500 will be described in detail below with reference to FIG. 2.

Referring to FIG. 2, according to steps S502 to S506, two word sequencesare selected from the same category as the first word sequence and thesecond word sequence, and the longest common subsequence of the firstword sequence and the second word sequence is calculated. It should bepointed out that in here and in the following, the selection of wordsequences can be arbitrarily performed in a qualified set.

According to step S508, the calculated LCS is compared with a secondthreshold. If the LCS length is greater than the second threshold, theLCS is converted into a text string template. Here, the second thresholdis adjustable, and its default value can be set as 80% of the greater ofthe length of the first word sequence and the length of the second wordsequence. Namely, the ratio of the length of the LCS to the greater ofthe length of the first word sequence and the length of the second wordsequence should be greater than 80%.

If the length of the calculated LCS is not greater than the secondthreshold, it is returned to step S504 to replace the current secondword sequence with another word sequence in the same category, and stepsS506 and S508 are repeated. Here, the other word sequence is selectedfrom the word sequences in the category that have not participated inthe LCS operation within the operation cycle of the current first wordsequence.

Steps S504 to S508 are repeated until a text string template isgenerated. If it is impossible to generate a template by exhausting allthe word sequences in the category (step S504), then the current firstword sequence is deleted, and the above step S504 is repeated byselecting a word sequence from the same category as the first wordsequence.

The process of generating a text string template is further describedbelow with reference to a specific example of the first word sequenceand the second word sequence. For the sake of simplicity, it is assumedthat the words in the first and second word sequences are all oneletter.

In one case, it is supposed that the first word sequence is {A, B, A, D,E, F, G} and the second word sequence is {A, B, B, D, E, F, G}. Afterthe LCS operation, the LCS of the first word sequence and the secondword sequence is {A, B, D, E, F, G} and the length is 6. Since thelengths of the first and second word sequences are both 7, it can beseen that the length of the LCS is greater than the default secondthreshold of 80%. Therefore, the LCS can be converted into a template{A, B, *, D, E, F, G}, wherein “*” is a placeholder word, meaning thatthere is at most one word between the words “B” and “D”. The placeholdercan also use other symbols. To avoid ambiguity, special words that donot appear in the input text are often used.

In another case, it is supposed that the first word sequence is {A, B,D, E, F, G} and the second word sequence is {A, B, B, D, E, F, G}. Afterthe LCS operation, the LCS of the first word sequence and the secondword sequence is {A, B, D, E, F, G} and the length is 6. Since thegreater of the lengths of the first and the second word sequences is 7,it can be seen that the LCS length is greater than the default secondthreshold of 80%. Therefore, it is also possible to convert this LCSinto a template {A, B, *, D, E, F, G}, wherein “*” is a placeholderword, meaning that there is at most one word between the words “B” and“D”.

In another case, the first word sequence is {A, B, A, D, E, F, G} andthe second word sequence is {A, B, B, C, E, F, G}. After the LCSoperation, the LCS of the first word sequence and the second wordsequence is {A, B, E, F, G}, and the length is 5. Since the lengths ofthe first and second word sequences are both 7, the LCS length is lessthan the default second threshold of 80%. Therefore, the LCS cannot beconverted to a template.

It should be understood that, depending on the actual length of the wordsequence, the generated template may include a plurality of placeholderwords “*”, each of which indicates that a maximum of one word can beinserted in its place.

After the text string template is generated, according to steps S510 toS512 in FIG. 3, another word sequence is selected with the text stringtemplate for LCS operation. Similar to step S502, the other wordsequence is selected from the word sequences in the category that havenot participated in the LCS operation within the operation cycle of thecurrent first word sequence. For example, if the LCS calculated from thefirst and second word sequences has been converted into a text stringtemplate, a word sequence in the category other than the current firstand second word sequences is selected with the text string template forLCS operation.

According to step S512, the calculated LCS is compared with the secondthreshold. If the LCS length is greater than the second threshold, theLCS is converted into a new text string template.

If the length of the calculated LCS is not greater than the secondthreshold, it is retuned to step S510.

Steps S510 to S514 are repeated until all word sequences in thiscategory are exhausted.

According to step S516, the text string template is output, and all theword sequences that the text string template can match are deleted fromthe category. Alternatively, the deletion of the word sequence withwhich the text string template can match can also be performed aftereach time the text string template is obtained.

It is returned to step S502 until the category is empty, that is, allword sequences in the category are deleted.

Similarly, the LCS and text string template operations are performed onother categories until all categories are empty.

The equipment 400 for determining the longest common subsequence among aplurality of text strings shown in FIG. 4 comprises a first conversiondevice 402, a second conversion device 404, a first operation device406, a classification device 408, and a second operation device 410.Among them, the first conversion device 402 is configured to convert aplurality of the text strings into word sequences respectively, thesecond conversion device 404 is configured to convert the word sequencesinto corresponding word sets respectively, the first operation device406 is configured to calculate the minimum hash value of each word set,the classification device 408 is configured to classify the wordsequences according to the minimum hash value, and the second operationdevice 410 is configured to perform the longest common subsequenceoperation in each category.

The functional modules of the device 400 may be implemented by hardware,software, or a combination of hardware and software to perform theabove-described method steps according to the present invention. Inaddition, the first conversion device 402, the second conversion device404, the first operation device 406, the classification device 408, andthe second operation device 410 may be combined or further decomposedinto sub-modules so as to execute the above-described method stepsaccording to the present invention. Therefore, any possible combination,decomposition or further definition of the above functional moduleswould fall within the scope of protection of the claims.

The present invention is not limited to the specific description aselaborated above, and any changes that are readily apparent to thoseskilled in the art on the basis of the above description are within thescope of the present invention.

The invention claimed is:
 1. A computer-implemented method fordetermining a longest common subsequence among a plurality of textstrings, comprising: converting, by a first conversion device, aplurality of the text strings into word sequences respectively;converting, by a second conversion device, the word sequences intocorresponding word sets respectively; calculating, by an operationdevice, a minimum hash value for each word set of the corresponding wordsets; classifying, by a classification device, the word sequences intocategories according to the minimum hash value; and performing, by theoperation device, a longest common subsequence operation in eachcategory, wherein two word sequences with a minimum hash distance lessthan a first threshold are classified into same category, wherein thelongest common subsequence operation includes: selecting a word sequencein the same category as a first word sequence, and respectivelycalculating the longest common subsequence of the first word sequencewith other word sequences in the same category until a length of thelongest common subsequence obtained is greater than a second threshold,wherein the longest common subsequence having a length greater than thesecond threshold is determined as a text string template, wherein thetext string template is used to calculate the longest common subsequencein turn with other word sequences in the same category, and during thecalculation of the longest common subsequence, the longest commonsubsequence having a length greater than the second threshold isdetermined as a new text string template and the calculation of thelongest common subsequence is continued.
 2. The method according toclaim 1, wherein the longest common subsequence operation furtherincludes: deleting the first word sequence from the category if alllengths of the longest common subsequences obtained are not greater thanthe second threshold, and continuing the longest common subsequenceoperation.
 3. The method according to claim 2, wherein the longestcommon subsequence having a length greater than the second threshold isdetermined as a text string template.
 4. The method according to claim3, wherein the text string template is used to calculate the longestcommon subsequence in turn with other word sequences in the category,and during the calculation of the longest common subsequence, thelongest common subsequence having a length greater than the secondthreshold is determined as a new text string template and thecalculation of the longest common subsequence is continued.
 5. Themethod according to claim 4, wherein a final text string template isoutput, and the word sequence in the category which can match the finaltext string template is deleted.
 6. The method according to claim 5,wherein the longest common subsequence operation is continued until thecategory is empty.
 7. The method according to claim 1, wherein a finaltext string template is output, and a word sequence in the categorywhich can match the final text string template is deleted.
 8. The methodaccording to claim 7, wherein the longest common subsequence operationis continued until the category is empty.
 9. An apparatus fordetermining a longest common subsequence among a plurality of textstrings, comprising a hardware wherein the hardware is configured toperform following operations including: converting, by a firstconversion device, a plurality of the text strings into word sequencesrespectively; converting, by a second coversion device, the wordsequences into corresponding word sets respectively; calculating, by theoperation device, a minimum hash value for each word set of thecorresponding word sets; classifying, by a classification device, theword sequences into categories according to the minimum hash value; andperforming longest common subsequence operation in each category,wherein two word sequences with a minimum hash distance less than afirst threshold are classified into same category, wherein the longestcommon subsequence operation includes: selecting a word sequence in thesame category as a first word sequence, and respectively calculating thelongest common subsequence of the first word sequence with other wordsequences in the same category until a length of the longest commonsubsequence obtained is greater than a second threshold, wherein thelongest common subsequence having a length greater than the secondthreshold is determined as a text string template, wherein the textstring template is used to calculate the longest common subsequence inturn with other word sequences in the same category, and during thecalculation of the longest common subsequence, the longest commonsubsequence having a length greater than the second threshold isdetermined as a next text string template and the calculation of thelongest common subsequence is continued.